feat: record and expose status_reason on model endpoints#840
Open
dm36 wants to merge 3 commits into
Open
Conversation
When an endpoint build fails (status UPDATE_FAILED), the cause was only logged and discarded — the GET model-endpoint API exposed status but no reason, so downstream consumers (e.g. SGP) could only show a generic failure message. Thread a status_reason through the stack so the failure cause is persisted and returned by the API: - add nullable status_reason column to the endpoints table (+ Alembic migration) - add status_reason to the ORM model, ModelEndpointRecord entity, the ORM->entity translation, and the create/update repository methods - expose status_reason on GetModelEndpointV1Response and map it in the use case - capture the cause at the UPDATE_FAILED sites (sanitized build error, image-build failure, infra-delete failure) and clear it when an endpoint returns to READY Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2 tasks
- Don't leak raw infra error text: only surface str(error) when it's a DomainException (the codebase's own user-facing errors); other exceptions (k8s/AWS/DB) fall back to a generic message. Full detail is still logged. - Replace the empty-string-as-clear convention with an explicit clear_status_reason flag so status_reason=None means "unchanged" like every other field in update_model_endpoint_record. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The image-build-failure path wrote a clean status_reason, then raised DockerBuildFailedException — a DomainException whose raw message (including the internal endpoint_id) was surfaced by the outer except handler, overwriting the clean message. Drop the redundant write and instead make the exception message itself the user-safe guidance (it becomes the status_reason via the outer handler); the endpoint_id is logged separately. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
When an endpoint build fails (status
UPDATE_FAILED), the cause is only logged and discarded — the GET model-endpoint API exposesstatusbut no reason. A downstream consumer (Scale's SGP / GenAI platform) polls this endpoint to surface deploy results to users, so today a failed deploy can only show a generic "failed" message with no actionable cause (CUDA OOM, bad checkpoint, image pull failure, etc.).This threads a
status_reasonthrough the stack so the failure cause is persisted and returned by the API:status_reasoncolumn to theendpointstable (+ Alembic migration).status_reasonto the ORM model,ModelEndpointRecordentity, the ORM→entity translation, and the create/update repository methods.status_reasononGetModelEndpointV1Responseand map it in the use case.UPDATE_FAILEDsites (sanitized build error, image-build failure, infra-delete failure) and clear it when an endpoint returns toREADY.Test plan
status_reasonflows through toGetModelEndpointV1Response.Notes
Open to adjusting field naming / sanitization length / which call sites capture a reason to match team conventions. Supersedes #839 (opened from a fork before access was granted).
🤖 Generated with Claude Code
Greptile Summary
This PR threads a
status_reasonfield through the full stack — DB column, ORM, domain entity, repository, use-case, and API response — so that the cause of anUPDATE_FAILEDdeployment is surfaced to consumers instead of being silently discarded.Textcolumnstatus_reasonto theendpointstable via a correctly-chained Alembic migration, and mirrors it in the SQLAlchemy model andModelEndpointRecordentity._status_reason_from_errorhelper gates what is persisted — onlyDomainExceptionmessages (intentionally user-safe) are surfaced; all other exceptions fall back to a generic string. The previous "dead write" beforeDockerBuildFailedExceptionwas raised has been removed, fixing the overwrite issue flagged in an earlier review.clear_status_reason=Trueis threaded through so the column is NULLed when an endpoint returns toREADY.Confidence Score: 5/5
Safe to merge. The change is purely additive: a nullable column with a correct migration, a new optional response field, and narrowly-scoped writes at existing failure paths.
All failure-capture paths are covered, the dead-write from the prior review is resolved, the migration chains correctly, and the DomainException gate prevents raw infra error text from leaking into the API. The one note is that a prior failure reason persists visibly during a subsequent UPDATE_IN_PROGRESS window, but that is a minor UX concern rather than a correctness defect.
No files require special attention. live_endpoint_builder_service.py is the most logic-dense file but the changes there are straightforward.
Important Files Changed
_status_reason_from_errorhelper and wiresstatus_reasoninto the outer exception handler; removes the now-redundant innerupdate_model_endpoint_recordbeforeDockerBuildFailedException. Clears reason on READY but not on UPDATE_IN_PROGRESS, so a prior failure reason can bleed into a retry window.Textcolumnstatus_reasonto theendpointstable with correctdown_revisionchaining to the prior migration (a1b2c3d4e5f6). Upgrade and downgrade are both present and correct.status_reasonto create/update paths. Usesdict_not_noneto skipNone(preserve existing value) andclear_status_reason=Trueto explicitly NULL out the column — design is sound and consistent with other nullable fields.status_reason: Optional[str] = NonetoModelEndpointRecordentity — minimal, correct change.status_reasontoGetModelEndpointV1Response; flows through toListModelEndpointsV1Responseautomatically since it reuses the same DTO.model_endpoint.record.status_reasoninto the response DTO — single-line, correct.status_reasonon infra-delete failure — correct, isolated change.status_reasonflows throughGetModelEndpointByIdV1UseCaseinto the API response.Sequence Diagram
%%{init: {'theme': 'neutral'}}%% sequenceDiagram participant Builder as LiveEndpointBuilderService participant Repo as DbModelEndpointRecordRepository participant DB as endpoints table participant API as GET /model-endpoints/{id} Builder->>Repo: "update(status=UPDATE_IN_PROGRESS)" Repo->>DB: "SET endpoint_status='UPDATE_IN_PROGRESS'" Note over DB: status_reason unchanged (old value persists) alt Build succeeds Builder->>Repo: "update(status=READY, clear_status_reason=True)" Repo->>DB: "SET endpoint_status='READY', status_reason=NULL" else Build fails Builder->>Builder: _status_reason_from_error(error) Note over Builder: DomainException → message; else generic string Builder->>Repo: "update(status=UPDATE_FAILED, status_reason=reason)" Repo->>DB: "SET endpoint_status='UPDATE_FAILED', status_reason=reason" end API->>DB: SELECT status, status_reason FROM endpoints DB-->>API: "{status, status_reason}" API-->>API: "GetModelEndpointV1Response(status_reason=...)"%%{init: {'theme': 'base', 'themeVariables': {"darkMode": true, "background": "#0d1117", "primaryColor": "#21262d", "primaryTextColor": "#e6edf3", "primaryBorderColor": "#8b949e", "lineColor": "#8b949e", "textColor": "#e6edf3", "edgeLabelBackground": "#161b22", "actorBkg": "#21262d", "actorBorder": "#8b949e", "actorTextColor": "#e6edf3", "actorLineColor": "#8b949e", "signalColor": "#8b949e", "signalTextColor": "#e6edf3", "noteBkgColor": "#373320", "noteBorderColor": "#d4a72c", "noteTextColor": "#f0e6c0", "labelBoxBkgColor": "#21262d", "labelBoxBorderColor": "#8b949e", "labelTextColor": "#e6edf3", "loopTextColor": "#e6edf3", "activationBkgColor": "#30363d", "activationBorderColor": "#8b949e"}}}%% sequenceDiagram participant Builder as LiveEndpointBuilderService participant Repo as DbModelEndpointRecordRepository participant DB as endpoints table participant API as GET /model-endpoints/{id} Builder->>Repo: "update(status=UPDATE_IN_PROGRESS)" Repo->>DB: "SET endpoint_status='UPDATE_IN_PROGRESS'" Note over DB: status_reason unchanged (old value persists) alt Build succeeds Builder->>Repo: "update(status=READY, clear_status_reason=True)" Repo->>DB: "SET endpoint_status='READY', status_reason=NULL" else Build fails Builder->>Builder: _status_reason_from_error(error) Note over Builder: DomainException → message; else generic string Builder->>Repo: "update(status=UPDATE_FAILED, status_reason=reason)" Repo->>DB: "SET endpoint_status='UPDATE_FAILED', status_reason=reason" end API->>DB: SELECT status, status_reason FROM endpoints DB-->>API: "{status, status_reason}" API-->>API: "GetModelEndpointV1Response(status_reason=...)"Reviews (2): Last reviewed commit: "fix: remove dead status_reason write on ..." | Re-trigger Greptile